﻿ DRuKoLA - Towards Contrastive German-Romanian Research based on Comparable Corpora 12,3454,6 Ruxandra Cosma, Dan Cristea, Marc Kupietz, Dan Tufiş, Andreas Witt 1 University of Bucharest, Faculty of Foreign Languages 2 Alexandru Ioan CuzaUniversity of Iaşi, Department of Computer Science 3 Romanian Academy, Institute for Computer Science - Iaşi 4 Institut für Deutsche Sprache, Mannheim 5 Institute for Artificial IntelligenceMihai Drăgănescu, Bucharest 6 Heidelberg University, Department of Computational Linguistics ruxandracosma@gmail com,dcristea@info uaic ro,{kupietz,witt}@ids-mannheim de,tufis@racai ro Abstract This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference CorpusDeReKoand the Reference Corpus of Contemporary Romanian Language CoRoLa in order to use these virtual corpora as empirical basis for contrastive linguistic research Keywords: Reference Corpora, Comparable Corpora, Contrastive Linguistics 1 Introduction2 Aims of the DRuKoLA project 1 The DRuKoLA projectthat is centered around the German Corpora have increasingly been used in cross-linguistic re- Reference CorpusDeReKo(Kupietz,et al 2010) and the search, where, in particular, parallel corpora have been of Reference Corpus of Contemporary Romanian Language major importance The usefulness of parallel resources for CoRoLa (Tufiş,et al 2015) has started in January 2016 cross-linguistic research is obvious, as they provide bi- or and is a cooperation between the University of Bucharest, multilingual, ideally aligned language data that convey the the Institute for the German Language in Mannheim, and same meaning, including contextual information, and can research institutes of the Romanian Academy in Bucharest thus serve as a basis for establishing equivalence between and Iaşi DRuKoLa is a transdisciplinary project involving particular entities across different languages (cf James 1980, corpus linguistics, computational linguistics, applied linguis- Chesterman 1998) On this account, parallel data have been tics and cross-linguistic studies, applied computer science, used as an empirical basis in many contrastive studies so corpus architecture and finally also research infrastructure far Some examples include Altenberg (1999), Hasselgård development Within this broad range of areas, DRuKoLA’s (2007), Zufferey and Cartoni (2012), Kaczmarska and Rosen concrete research objectives are: (2013), where various phenomena from English and Swedish, 1 Construction, provision and harmonization of comparable English, Swedish and Norwegian, English and French, Polish corpora in the two languages and Czech, respectively, have been accounted for 2 Development of criteria for building comparable virtual Recently, there has also been growing interest in developing sub-corpora based onDeReKoand CoRoLa, the German comparable corpora (see Sharoff et al 2013 and the work- and, respectively, the Romanian corpus, based on metadata shop series Building and Using Comparable Corpora) but and other possible text properties so far, no comparable resources are available (at least not 3 Exploration of language-specific peculiarities of the stud- for German and Romanian) that would allow us to conduct ied languages and equivalences with respect to different cross-linguistic investigations drawing on language-specific parameters and structures grammatical and semantic properties The reasons for the 4 Corpus-based comparative case studies on a) markers of DRuKoLA project, as will be sketched in this paper, is to modality:haben/a aveawithzu-infinitives and supine see if a common building strategy can be used for a pair of reference corpora belonging to languages of two diverse 1 DRuKoLA is funded by the Alexander von Humboldt- families, if a common view on the management of the two Foundation as a Research Group Linkage Programme between the corpora can be used and if the access to them can be organ- University of Bucharest and the Institute for the German Language ised with a common corpus analysis platform Moreover, the in Mannheim, with the Institute for Artificial IntelligenceMihai project will investigate how comparable virtual collections Drăgănescu(RACAI, Bucharest) and the Institute of Computer (sub-corpora) can be extracted dynamically from this shared Science (IIT, Iaşi) of the Romanian Academy as associated part- resource and how they can serve as flexible, cost-efficient and ners The acronym combines central goals of the project: corpus high-qualitative empirical bases for answering comparative development and contrastive linguistic analysis (Sprachvergleich linguistic research questions korpustechnologisch Deutsch - Rumänisch) 28 and b) (abstract) demonstratives in German and Roma-3 1 DeReKo nian, c) general investigation of distributional semantic The German Reference CorpusDeReKo(Deutsches Referen- and syntagmatic properties of corresponding forms and zkorpus) has been developed at the IDS since its inception in structures 1964 With more than 25 billion words (Kupietz and Lüngen, 5 Development of corpus technology to share the corpus, 2014), it is the world’s largest collection of German texts In technical and research results in a common Corpus Ana- contrast to other reference or national corpora,DeReKois not lysis platform designed to be used as a monolithic corpus Instead, it adopts 6 Building a structure that can serve as acrystallization point a primordial-sample design approach (Kupietzet al , 2010), for other national or reference corpora with the long-term which invites users to create stratified sub-samples (referred goal of building a federated, at least European, reference to as virtual corpora or virtual collections), custom-tailored corpus where each corpus is still physically located at and to their respective research questions and basic populations curated by its responsible institute, but can be dynamically Such an approach effectively allows for maximization of extracted to different comparable corpora its size, diversity and applicability for different research questions (Kupietzet al , 2014) and is also fundamental We should also mention that at least the objectives 2 – 5 are for the definition of different virtual comparable corpora in planned to be carried out in parallel and in a cyclic bootstrap- the DRuKoLA-context DeReKoprovides a broad variety of ping fashion That means for example that the initial naive text types with a quantitative focus on newspaper texts and definition of the comparable corpora and the analysis and vi- rapidly growing portion of computer mediated communica- sualization functions of the query software will be iteratively tion (cf Beißwengeret al , 2015; Margaretha and Lüngen, refined based on the results of the linguistic analyses As a 2014; Schröck and Lüngen, 2015) DeReKois endowed welcome side-effect of this procedure, we expect to acquire with rich metadata (Klosaet al , 2012; Kupietz and Keibel, a good impression of to what extent the linguistic results 2009), multiply annotated on the part-of-speech, dependency vary with different corpus compositions and thereby an im- and constituency levels (Belicaet al , 2011) and sufficiently pression of reliability and generalizability of the obtained licensed to be queried and analyzed for non-commercial lin- findings guistic purposes (QAO-NC license, see Kupietz and Lüngen, While research objective (6) is also a long-term goal, we 2014) already expect numerous synergy effects within the range of current project First of all, we are convinced that joining 3 2 CoRoLa national reference or national corpora virtually, with each Currently, CoRoLa contains more than 191 million word institute still being responsible for the curation and exten- forms of written text and about 135 hours of transcribed sion of its own resources is a much more economical and speech (Tufişet al , 2016) In its first public version, CoRoLa sustainable approach than building multiple comparable cor- will contain more than 500 million word forms and more pora from scratch and maintaining them on a project-basis than 300 hours of transcribed speech (approximately 3 mil- Another aspect concerns the development and maintenance lion words) and it will be IPR (Intellectual Property Rights) of sustainable research software that is currently carried cleared It aims at being representative for the literary lan- out individually for each reference corpus A closer col- guage The corpus covers the following 35 subdomains: laboration in this field with joint forces has the potential of literature, politics, gossip, film, music, economy, health, lin- reducing the investments on infrastructure, that are always guistics, theatre, painting/drawing, law, sport, education, difficult in the academic context, to a fraction In addition to history, religious studies and theology, medicine, technology, such mostly economical arguments, we are convinced that chemistry, entertainment, environment, architecture, engi- bringing the (corpus-) linguistic communities of different neering, pharmacology, art history, administration, enol- languages together – currently still too much centered around ogy, pedagogy, philology, juridical sciences, biology, social, their philologies – has on its own a large boost potential 2 mathematics, social events, philosophy, other The domains and sub-domains classification is based on the Wikipedia 3 The underlying corpus resources one The functional styles considered are: journalistic, sci- entific, imaginative, memorialistic, administrative,juridic Starting a project like this – situated in very different mo- andother(see footnote 2) CoRoLa uses similar realisation ments of corpus development and architecture – is a rare 3 conventions as the Romanian Balanced Corpus (ROMBAC) opportunity, as on the one hand we are working on and (Ionet al , 2012) containing over 44 million tokens from five witnessing the construction of the Romanian Contemporary domains (news, medical, legal, biographicandfiction) The Reference Corpus from its beginnings and, on the other hand, creators of CoRoLa pay special attention in obtaining the are working with a very advanced German reference corpus, consent of owners before including their texts in the cor- analysis system and technology The collection of data for pus; thus, protocols of collaboration have been signed with German started more than 50 years ago and the exploration a number of publishing houses, editorial offices, and radio of principles and methods of empirical anchoring linguis- channels tic studies at the IDS in the beginning of the nineties The In line with the current diversification of language and project CoRoLa started only in 2014 as a project of national priority of the Romanian Academy The corpus is rapidly 2 growing, as it is simultaneously being performed in two dif- This is a category for all documents that could not be definitely ferent institutes of computer sciences, in Bucharest and in classified into the named categories 3 Iaşi http://www meta-net eu/meta-share 29 speech information available in modern representative cor-successor of the corpus search and management system COS- pora, CoRoLa will include a syntactically annotated sub-MAS that was launched in 1994 and in its second incarnation corpus and an oral component All textual data is morpho-(COSMAS II), is currently used by 39 000 German linguists lexically processed (tokenized, POS-tagged and lemmatized) Besides KorAP’s more performance oriented features, like The current annotations are provided in-line but, in the future,horizontal scalability with respect to an unbounded corpus as different layers of linguistic annotations (noun phrases,size and any number of annotation layers, two are particu- dependency parses, name entities, semantic relations, dis-larly fundamental for DRuKoLA: 1 ) its ability to manage course structures etc ) will be provided for the same data, acorpora that are physically located at different places, in mixed (in-line and standoff) annotation will be used Theorder to comply with typical license restrictions (cf Kupi- 4 Universal Dependency (UD)compliant treebank (targeted:etzet al , 2014) and 2 ) its ability to dynamically create more than 10 000 hand validated sentences) and the oralvirtual sub-corpora based on text properties and to manage component have additional annotations (dependency links,these virtual corpora in a persistent way, to e g allow for respectively speech segmentation at sentence level, pauses,reusability and reproducibility Further features that will be non-lexical sounds, like breath, cough, laugh, sneeze, andrequired for the rather mono-linguistic research purposes partial explicit marking of the accent) will be integrated from recent and ongoing developments of the project partners, as for example the interactive overview The metadata annotators (many of which are volunteers) visualizations of corpus compositions (Perkuhn and Kupi- work under the guidance of a detailed Annotation Man- etz, forthcoming), or the visualisation of query expressions ual Started two years before the initiation of DRuKoLA, as graphs, allowed by the GGS mechanism (Simionescu, the work till now devoted in building CoRoLa was techni- 2012) GGS (Graphical Grammar Studio) is an open-source cally supported by an online platform (developed at IIT-Iaşi), platform allowing interactive writing of grammars that an- which includes facilities for cleaning formatting, standard- notate sequences of XML elements at any levels and which ising Romanian diacritics, eliminating hyphenation, visual- has been recently augmented with a constraint-based search izing statistics about the quantity of texts accumulated and mechanism (Simionescu, forthcoming) Also functionalities their subdomains, and filling in metadata However, many specifically required for the contrastive research tasks will clearing phases are still done manually: separating articles first be inventoried and then developed during the project from periodicals in different files, removal of headers, page numbers, figures, tables, text fragments in foreign languages, 5 Corpus based contrastive case studies excerpts from other authors, and annotation of footers and endnotes (decided to be left in the texts) Based on recent or current research interests of the participat- ing linguists on definite DPs in Romanian (Cornilescu and 3 3 Harmonization ofDeReKoand CoRoLa Nicolae, 2011a; 2011b), situational use of demonstratives Both CoRoLa andDeReKometadata comply with CMDI (Cosma and Engelberg, 2014) or particularities of the Roma- 56 (Component MetaData Infrastructure)and/or TEI-P5stan- nian verbal supine form (Cornilescu and Cosma, 2013; 2014) dards For the construction of comparable corpora, however, the project is primarily sustained – as part of the harmoniza- in addition to mainly syntactical interoperability, also seman- tion process – in the making and adapting analyzing instru- tic interoperability has to be achieved, for example for the ments for Romanian The testing phase of the developed metadata categories that are used for the construction of vir- instruments will then serve data-based linguistic research tual corpora The general procedure for the harmonization and will help identify linguistic variation and preferences of data categories and value sets will be to define functions within selected research topics: modality markershaben-zu that map the original respective data to more coarse-grained infinitives in German and their equivalent finite and nonfinite taxonomies Additional harmonization work will also be forms (are deV-ut,areV,are săV) in Romanian, infinitivesubj required on lower levels, e g for the integration of CoRoLa demonstratives in different uses and positions, reinforcement into the KorAP corpus query engine, or for the adoption of patterns of demonstratives through adverbs as indieser hier, the GGS query mechanism developed for CoRoLa as an aux- dieses schöne Auto da, propositional reference of demon- iliary search engine to express constraints that would exploit stratives (das,asta) etc Therefore possible aspects to be the multi-layered annotation ofDeReKo, both mentioned syntactically explored include: i) distributional patterns of 7 in the following section The first DRuKoLA workshopis haben-zuinfinitives withhabenas a raising verb, distribution expected to answer many of these questions of the equivalent form variants of Romanianare de/a/să+ V; ii) identifying structural and stylistic factors in the use of 4 Query and analysis software one of the three equivalent forms of thehaben-zuinfinitive in Romanian; iii) the use of propositional demonstrativedas The software that will be used for conducting the corpus lin- and singular and plural form differentiated abstract demon- guistic research within DRuKoLA and for making the project strativesasta/asteain Romanian, etc For the exploration results available to the community is the corpus query- and and analysis of distributional semantic and syntagmatic prop- analysis platform KorAP that has recently been developed at erties we will use collocation profiles (Belicaet al , 2010; the IDS (Bańskiet al , 2013; 2014) KorAP is the designated Belica, 2011) as well as word embeddings (Mikolovet al , 4 2013; Linget al , 2015) http://universaldependencies github io/docs/ 5 http://www clarin eu/content/component-metadata 6 http://www tei-c org/Guidelines/P5/ 7 The workshop takes place in April this year in Bucharest 30 6 ConclusionsBelica, C (2011) Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen In: Abel, A , Zanin, R (eds ):Kor- We have presented in this paper a very young German- pora in Lehre und Forschung, S 155-178 Bozen-Bolzano Romanian project, intended to harmonize methods and tools University Press Freie Universität Bozen-Bolzano for building and exploiting corpora in these two languages Belica, C , Kupietz, M , Witt, A and Lüngen, H (2011) The The idea of the project is to apply a long-standing tradition Morphosyntactic Annotation of DeReKo: Interpretation, in the creation of corpora to a newly-born one At one pole Opportunities, and Pitfalls In: Konopka, M , Kubczak, of this project there is the experience gained by the IDS J , Mair, C , Šticha, F , Waßner, U (eds ):Grammar and Mannheim in the creation ofDeReKo, the largest German Corpora2009 Third international conference Tübingen: language corpus Two years before this project was initiated, Narr 451-469 the work on the Contemporary Romanian Language Corpus Chesterman, A (1998) Contrastive Functional Analysis was simultaneously started in Bucharest and Iaşi The experi- Amsterdam/Philadelphia: John Benjamins Publishing ence gathered in this period (find providers of texts and vocal Company recordings, agree on the metadata being used, design and Cornilescu A , Nicolae, A (2011a) Nominal Peripheries build an interactive platform that helps to clean the linguistic and Phase Structure in the Romanian DP In:Revue data and fill-in metadata, and design an access mechanism) Roumaine de LinguistiqueLVI, 1 , 35–68 will now have to be harmonised with the already running Cornilescu, A , Nicolae, A (2011b) On the Syntax of the Ro- German machine Whether one common methodology will manian Definite Phrases: Changes in the Patterns of Def- be applicable to both corpora, comparable conventions will initeness Checking In: Sleeman, P , Perridon H (eds ): have to be fixed through an updating process This will not The Noun Phrase in Romance and Germanic Structure, only make possible extremely interesting contrastive studies Variation and Change Amsterdam: John Benjamins over the two languages and will produce a very large compar- 193222 ative bilingual corpus (with interesting possible beneficials Cornilescu, A , Cosma, R (2013) Restructuring strategies for the MT technology), but the lessons learned from this en- as means of providing increased referentiality for the in- terprise could be extended at the European level, to prepare ternal argument of the de-supine clause InBucharest the stage for a multilingual unification of corpora, method- Working Papers in Linguisticsvol XV 2 , 91-121 ologically and technologically, with tremendous beneficial Cornilescu, A , Cosma, R (2014) On the functional struc- effects in the multilingual language research ture of the Romanian de-supine In: Cosma, R , En- 7 References gelberg, S , Schlotthauer, S , Stănescu, S , Zifonun, G (eds ):Komplexe Argumentstrukturen Kontrastive Unter- Altenberg, B (1999) Adverbial connectors in English and suchungen zum Deutschen, Rumänischen und Englischen Swedish: Semantic and lexical correspondences In Has- Berlin/München/Boston: de Gruyter [Konvergenz und selgård and Oksemell (eds )Out of Corpora Amsterdam: Divergenz 3] 283-335 Rodopi, 249-268 Cosma, R , Engelberg, S (2014) Subjektsätze als alterna- Bański, P , Bingel, J , Diewald, N , Frick, E , Hanl, M , Kupi- tive Argumentrealisierungen im Deutschen und Rumänis- etz, M , Pęzik, P , Schnober, C and Witt, A (2013) Ko- chen Eine kontrastive quantitative Korpusstudie zu Psych- rAP: the new corpus analysis platform at IDS Mannheim Verben In: Cosma, R , Engelberg, S , Schlotthauer, S , Presented at the 6th Conference on Language and Tech- Stănescu, S , Zifonun, G (eds ):Komplexe Argumentstruk- nology (LTC-2013), Poznań, Polen, December 2013 turen Kontrastive Untersuchungen zum Deutschen, Bański, P , Diewald, N , Hanl, M , Kupietz, M and Witt, Rumänischen und Englischen Berlin/München/Boston: A (2014) Access Control by Query Rewriting: the de Gruyter 339- 420 Case of KorAP In:Proceedings of the 9th conference Ion, R , Irimia, E , Ştefănescu, D and Tufiş, D (2012) ROM- on the Language Resources and Evaluation Conference BAC: The Romanian Balanced Annotated Corpus In Cal- (LREC 2014), European Language Resources Association th zolari, Nicoletta et al (eds ) Proceedings of the 8LREC (ELRA), Reykjavik, Iceland, May 2014 3817-3822 339-344 Beißwenger, M , Ehrhardt, E , Horbach, A , Lüngen, H , Johansson, S (1999) Corpora and contrastive studies In Steffen, D and Storrer, A (2015) Adding Value to CMC Pietilä, P and Salo, O -P (eds ):Multiple Languages – Corpora: CLARINification and Part-of-speech Annota- Multiple Perspectives AFinLA Yearbook 1999 / No 57, tion of the Dortmund Chat Corpus In: Beißwenger, M 116-125 and Zesch, T (eds ): NLP4CMC 2015 2nd Workshop Kaczmarska, E , Rosen, A (2013) Między znacze- on Natural Language Processing for Computer-Mediated niem leksykalnym a walencją – próba opracowania Communication / Social Media Proceedings of the Work- metody ekstrakcji ekwiwalentów na podstawie korpusu shop , September 29, 2015 University of Duisburg-Essen, równoległego Studia z Filologii Polskiej i Słowiańskiej, Campus Essen S 12-16 - : German Society for Computa- 48: 103–121 Warszawa tional Linguistics & Language Technology (GSCL), 2015 Klosa, A , Kupietz, M , and Lüngen, H (2012) Zum Nutzen von Korpusauszeichnungen für die Lexikographie In: Belica, C , Keibel, H , Kupietz, M and Perkuhn, R (2010) Lexicographica 28 Berlin/Boston: de Gruyter, 71–97 An empiricist’s view of the ontology of lexical-semantic relations In: Storjohann, P (ed )Lexical-Semantic Re-Kupietz, M and Keibel, H (2009): The Mannheim German lations Theoretical and practical perspectives JohnReference Corpus (DeReKo) as a basis for empirical lin- Benjamins Publishing Company 115-144 guistic research In: Minegishi, Makoto/Kawaguchi, Yuji 31 (Eds ): Working Papers in Corpus-based Linguistics and135–143 Language Education, No 3 - Tokyo: Tokyo UniversitySimionescu, R (forthcoming): Symbolic Mechanisms for of Foreign Studies, 53–59 Describing Linguistic Constraints Ph D Thesis, “Alexan- dru Ioan Cuza” University of Iaşi Kupietz, M , Belica, C , Keibel, H and Witt, A (2010) Tufiş, D , Barbu Mititelu, V , Irimia, E , Dumitrescu, Ş D , The German Reference Corpus DeReKo: A primordial Boroş, T , Teodorescu, N H , Cristea, D , Scutelnicu, A , sample for linguistic research In: Calzolari, N et al Bolea, C , Moruz, A and Pistol, L (2015): CoRoLa Starts (eds ):Proceedings of LREC 2010 1848-1854 Blooming – An Update on the Reference Corpus of Con- Kupietz, M , Lüngen, H (2014) Recent Developments in temporary Romanian Language InProceedings of the DeReKo In: Calzolari, Nicoletta et al (eds ):Proceed- 3rd Workshop on Challenges in the Management of Large ings of the Ninth International Conference on Language Corpora(CMLC-3), 5-10 Resources and Evaluation(LREC’14) Reykjavik: ELRA, Tufiş, D , Barbu Mititelu, V , Irimia, E , Dumitrescu, Ş , 2378-2385 D , Boroş, T (2016): The IPR-cleared Corpus of Con- Kupietz, M , Lüngen, H , Bański, P and Belica, C (2014) temporary Written and Spoken Romanian Language In: Maximizing the Potential of Very Large Corpora In: Calzolari, Nicoletta et al (eds ):Proceedings of the Tenth Kupietz, M , Biber, H , Lüngen, H , Bański, P , Breit- International Conference on Language Resources and eneder, E , Mörth, K , Witt, A , Takhsha, J (eds ):Pro- Evaluation(LREC’16), Portoroz, Slovenia ceedings of the LREC-2014-Workshop Challenges in the Wälchli, B (2007): Advantages and disadvantages of using Management of Large Corpora (CMLC2) Reykjavik: parallel texts in typological investigations In:Sprachty- ELRA, 1–6 pologie und Universalienforschung60:2 118-134 Ling, W , Dyer, C , Black, A and Trancoso, I (2015) Two/Too Simple Adaptations of word2vec for Syntax Prob- lems In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Denver, CO: ACL Margaretha, E , Lüngen, H (2014) Building Linguistic Corpora from Wikipedia Articles and Discussions In: Beißwenger, M , Oostdijk, N , Storrer, A , van den Heuvel, H (eds ):Journal for Language Technology and Com- putational Linguistics(JLCL) 29 (2) Special Issue on Building and Annotating Corpora of Computer-mediated Communication: Issues and Challenges at the Interface between Computational and Corpus Linguistics Regens- burg: GSCL, 2014, 59-82 Mikolov, T , Sutskever, I , Chen, K , Corrado, G S , Dean, J (2013): Distributed representations of words and phrases and their compositionality InProceedings of NIPS(Ad- vances in Neural Information Processing Systems) 2013, 3111–3119 Perkuhn, R , Kupietz, M (forthcoming): Visualisierung als erkenntnisleitendes Instrument In Bubenhofer, N and Kupietz, M :Proceedings of the Herrenhausen- Symposium on Visual Linguistics2014 Schröck, J , Lüngen, H (2015): Building and Anno- tating a Corpus of German-Language Newsgroups In: Beißwenger, M , Zesch, T (ed ): NLP4CMC 2015 2nd Workshop on Natural Language Processing for Computer- Mediated Communication / Social Media Proceed- ings of the Workshop, September 29, 2015 University of Duisburg-Essen, Campus Essen German Society for Computational Linguistics & Language Technology (GSCL), 2015, 17–22 Sharoff, S , Rapp, R , Zweigenbaum, P and Fung, P (eds ) (2013) Building and Using Comparable Corpora Springer Simionescu, R (2012): Romanian Deep Noun Phrase Chunk- ing Using Graphical Grammar Studio InProceedings of the Conference ”Linguistic Resources and Instruments for Romanian Language - ConsILR-2011”, Bucharest, “Alexandru Ioan Cuza” University of Iaşi Editing House, 32